17 research outputs found

    Handwritten and Printed Text Separation in Real Document

    Get PDF
    The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset where samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013

    Handwritten and Printed Text Separation in Real Document

    Get PDF
    International audienceThe aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset are used, where samples are basically composed of complex administrative documents

    Receipt Dataset for Fraud Detection

    Get PDF
    International audienceThe aim of this paper is to introduce a new dataset initially created to work on fraud detection in documents. This dataset is composed of 1969 images of receipts and the associated OCR result for each. The article details the dataset and its interest for the document analysis community. We indeed share this dataset with the community as a benchmark for the evaluation of fraud detection approaches

    Administrative Document Analysis and Structure

    Get PDF
    International audienceThis chapter reports our knowledge about the analysis and recognition of scanned administrative documents. Regarding essentially the administrative paper flow with new and continuous arrivals, all the conventional techniques reserved to static databases modeling and recognition are doomed to failure. For this purpose, a new technique based on the experience was investigated giving very promising results. This technique is related to the case-based reasoning already used in data mining and various problems of machine learning. After the presentation of the context related to the administrative document flow and its requirements in a real time processing, we present a case based reasonning for invoice processing. The case corresponds to the co-existence of a problem and its solution. The problem in an invoice corresponds to a local structure such as the keywords of an address or the line patterns in the amounts table, while the solution is related to their content. This problem is then compared to a document case base using graph probing. For this purpose, we proposed an improvement of an already existing neural network called Incremental Growing Neural Ga

    Human-document interaction systems: a new frontier for document image analysis

    Get PDF
    © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.All indications show that paper documents will not cede in favour of their digital counterparts, but will instead be used increasingly in conjunction with digital information. An open challenge is how to seamlessly link the physical with the digital – how to continue taking advantage of the important affordances of paper, without missing out on digital functionality. This paper presents the authors’ experience with developing systems for Human-Document Interaction based on augmented document interfaces and examines new challenges and opportunities arising for the document image analysis field in this area. The system presented combines state of the art camera-based document image analysis techniques with a range of complementary technologies to offer fluid Human-Document Interaction. Both fixed and nomadic setups are discussed that have gone through user testing in real-life environments, and use cases are presented that span the spectrum from business to educational applications.Peer ReviewedPostprint (author's final draft

    Incremental knowledge system for document understanding and fraud detection

    No full text
    Le Document Understanding est la discipline de l’Intelligence Artificielle qui dote les machines du pouvoir de Lecture. Cette capacité sous-entend de comprendre dans une vision globale l’objet du document, sa classe, et dans une vision locale, des informations précises, des entités. Un double défi est de réussir ces opérations dans plus de 90% des cas tout en éduquant la machine avec le moins d’effort humain possible. Cette thèse défend la possibilité de relever ces défis avec des méthodes à apprentissage incrémental. Nos propositions permettent d’éduquer efficacement et itérativement une machine avec quelques exemples de document. Pour la classification, nous démontrons (1) la possibilité de découvrir itérativement des descripteurs textuels, (2) l’intérêt de tenir compte de l’ordre du discours et (3) l’intérêt d’intégrer dans le modèle de donnée incrémental une mémoire épisodique de quelques Souvenirs d’échantillon. Pour l’extraction d’entité, nous démontrons un modèle structurel itératif à partir d’un graphe en étoile dont la robustesse est améliorée avec quelques connaissances a priori d’ordre général. Conscient de l’importance économique et sociétale de la fraude dans les flux documentaires, cette thèse fait également le point sur cette problématique. Notre contribution est modeste en étudiant les catégories de fraude pour ouvrir des perspectives de recherche. Cette thèse a été conduite dans un cadre atypique en conjonction avec une activité industrielle à Yooz et des projets collaboratifs, en particulier, les projets FEDER SECURDOC soutenu par la région Nouvelle Aquitaine et Labcom IDEAS soutenu par l’ANR.The Document Understanding is the Artificial Intelligence ability for machines to Read documents. In a global vision, it aims the understanding of the document function, the document class, and in a more local vision, it aims the understanding of some specific details like entities. The scientific challenge is to recognize more than 90% of the data. While the industrial challenge requires this performance with the least human effort to train the machine. This thesis defends that Incremental Learning methods can cope with both challenges. The proposals enable an efficient iterative training with very few document samples. For the classification task, we demonstrate (1) the continue learning of textual descriptors, (2) the benefit of the discourse sequence, (3) the benefit of integrating a Souvenir of few samples in the knowledge model. For the data extraction task, we demonstrate an iterative structural model, based on a star-graph representation, which is enhanced by the embedding of few a priori knowledges. Aware about economic and societal impacts because the document fraud, this thesis deals with this issue too. Our modest contribution is only to study the different fraud categories to open further research. This research work has been done in a non-classic framework, in conjunction of industrial activities for Yooz and collaborative research projects like the FEDER Securdoc project supported by la région Nouvelle Aquitaine, and the Labcom IDEAS supported by the ANR

    Segmentation de flux de documents Application aux documents administratifs

    Get PDF
    International audienceRÉSUMÉ. Cet article propose une approche de segmentation supervisée de flux de documents. L'approche traite le flux de documents comme une suite de paires de pages et étudie la relation qui existe entre elles pour déceler une continuité de documents ou une rupture. Dans un premier temps, des descripteurs sont extraits des pages et une approche est proposée pour fusionner ces descripteurs en un seul vecteur qui modélise la relation entre les paires de pages. Cette représentation est fournie à un classifieur binaire qui la classifie comme étant une rupture (synonyme de segmentation) ou une continuité. Dans le cas d'une rupture, nous considérons que nous avons atteint la limite d'un document complet et l'analyse du flux continue en commençant par un nouveau document. En cas d'une continuité, les deux pages sont considérées comme appartenant à un même document. S'il y a une incertitude sur la classe de la limite, un rejet est décidé et les pages analysées jusqu'à ce point sont considérées comme un « fragment » on réalise ici une sur-segmentation. Cette classification donne de bons résultats approchant 90% sur certains documents, ce qui est élevé à ce niveau du système. ABSTRACT. This paper proposes a document flow supervised segmentation approach. Our algorithm treats the flow of documents as couples of consecutive pages and examines the relationship that exists between them in order to present a document continuity or rupture. In a first step, descriptors are extracted from the pages and an approach is proposed to merge these descriptors into a single vector that models the relationship between pairs of pages. This representation is provided to a binary classifier that classifies it as either a rupture (synonymous with segmentation) or continuity. In case of a rupture, we consider that the limit of a complete document has been reached and the stream analysis continues by starting a new document. In case of continuity, the two pages are considered to belong to the same document. If there is an uncertainty on the class of the limit, a rejection is decided and the pages analyzed until this point are considered as a "fragment" and an over-segmentation is applied. The classification provides good results approaching 90% on certain documents, which is high at this level of the system. MOTS-CLÉS : Segmentation de flux de documents, descripteurs textuels, classification

    Handwritten/printed text separation Using pseudo- lines for contextual re-labeling

    Get PDF
    International audience—This paper addresses the problem of machine printed and handwritten text separation in real noisy documents. We have proposed in a previous work a robust separation system relying on a proximity string segmentation algorithm. The extracted pseudo-lines and pseudo-words are used as basic blocks for classification. A multi-class support vector machine (SVM) with Gaussian kernel associates first an appropriate label to each pseudo-word. Then, the local neighborhood of each pseudo-word is studied in order to propagate the context and correct the classification errors. In this work, we first propose to model the separation problem by conditional random fields considering the horizontal neighborhood. As the considered neighborhood is too local to solve certain error cases, we have enhanced this method by using a more global context based on class dominance in the pseudo-line. The method has been evaluated on business documents. It separates handwritten and printed text with better scores (99.1% and 99.2% respectively), contrary to noise which is very random in these documents (90.1%)

    Localisation automatique de champs de saisie sur des images de formulaires couleur par isomorphisme de sous-graphe

    No full text
    International audienceThis paper presents an approach for spotting textual fields in colored forms. We proceed by locating these fields thanks to their neighboring context which is modeled with a structural representation. First, informative zones are extracted. Second, forms are represented by graphs in which nodes represent colored rectangles while edges represent neighboring links. Finally, the context of the queried region of interest is modeled as a graph. Subgraph isomorphism is applied in order to locate this ROI in the structural representation of a whole document. Evaluated on a 130-document image dataset, experimental results show up that our approach is efficient and that the requested information is found even if its position is changed.Cet article présente une approche permettant la localisation de champs de saisie sur des images couleur de formulaires. Ces champs sont localisés grâce à une modélisation structurelle représentant leur contexte. Dans un premier temps, les zones informatives sont ex-traites. Les formulaires sont ensuite représentés par des graphes au sein desquels les noeuds représentent des rectangles de couleur uniforme tandis que les arcs modélisent les relations de voisinage. Finalement, le voisinage de la région d'intérêt à localiser est également représenté par un graphe. Une recherche d'isomorphisme de sous graphe vise à localiser le graphe modélisant le voisinage de la région d'intérêt au sein de la représentation structurelle du document cible. Une expérimentation est réalisée sur une base de 130 images de document. Les résultats montrent l'efficacité de la méthode même si la position de la région d'intérêt est variable

    One-shot field spotting on colored forms using subgraph isomorphism

    No full text
    International audienceThis paper presents an approach for spotting tex-tual fields in commercial and administrative colored forms. We proceed by locating these fields thanks to their neighboring context which is modeled with a structural representation. First, informative zones are extracted. Second, forms are represented by graphs. In these graphs, nodes represent colored rectangular shapes while edges represent neighboring relations. Finally, the neighboring context of the queried region of interest is modeled as a graph. Subgraph isomorphism is applied in order to locate this ROI in the structural representation of a whole document. Evaluated on a 130-document image dataset, experimental results show up that our approach is efficient and that the requested information is found even if its position is changed
    corecore